Automatic creation of WordNets from parallel corpora

نویسندگان

  • Antoni Oliver
  • Salvador Climent
چکیده

In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordNet creation is converted into a word alignment problem, where we want to align WordNet synsets in the English part of the corpus with lemmata on the target language part of the corpus. The word alignment algorithm used in these experiments is a simple most frequent translation algorithm implemented into the WN-Toolkit. The obtained precision values are quite satisfactory, but the overall number of extracted synset-variant pairs is too low, leading into very poor recall values. In the conclusions, the use of more advanced word alignment algorithms, such as Giza++, Fast Align or Berkeley aligner is suggested.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

WN-Toolkit: Automatic generation of WordNets following the expand model

This paper presents a set of methodologies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Portuguese. Along with the methodologies and evaluation we present an implemen...

متن کامل

Fine-Grained Word Sense Disambiguation Based on Parallel Corpora, Word Alignment, Word Clustering and Aligned Wordnets

The paper presents a method for word sense disambiguation based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and being supported by available aligned wordnets for the languages in the corpus. The wordnets are aligned to the Princeton Wordnet, according to the principles established by Euro...

متن کامل

Word Sense Disambiguation: A Case Study on the Granularity of Sense Distinctions

The paper presents a method for word sense disambiguation (WSD) based on parallel corpora. The method exploits recent advances in word alignment and word clustering based on automatic extraction of translation equivalents and is supported by a lexical ontology made of aligned wordnets for the languages in the corpora. The wordnets are aligned to the Princeton Wordnet, according to the principle...

متن کامل

Leveraging Parallel Corpora and Existing Wordnets for Automatic Construction of the Slovene Wordnet

The paper reports on a series of experiments conducted in order to test the feasibility of automatically generating synsets for Slovene wordnet. The resources used were the multilingual parallel corpus of George Orwell’s Nineteen Eighty-Four and wordnets for several languages. First, the corpus was word-aligned to obtain multilingual lexicons and then these lexicons were compared to the wordnet...

متن کامل

Methodology and evaluation of the Galician WordNet expansion with the WN-Toolkit

In this paper the methodology and a detailed evaluation of the results of the expansion of the Galician WordNet using the WN-Toolkit are presented. This toolkit allows the creation and expansion of wordnets using the expand model. In our experiments we have used methodologies based on dictionaries and parallel corpora. The evaluation of the results has been performed both in an automatic and in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014